Skip to content

feat: add kep-98:ResourceScalingGroup (RSG) - Advanced Scaling API fo…#120

Open
Mag-FelixFelicis wants to merge 1 commit into
sgl-project:mainfrom
Mag-FelixFelicis:kep_98
Open

feat: add kep-98:ResourceScalingGroup (RSG) - Advanced Scaling API fo…#120
Mag-FelixFelicis wants to merge 1 commit into
sgl-project:mainfrom
Mag-FelixFelicis:kep_98

Conversation

@Mag-FelixFelicis

Copy link
Copy Markdown

…r Distributed AI Inference

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this KEP together. The motivation around multi-role coupled scaling and non-intrusive adoption of existing workloads is compelling. A few things I'd like to see addressed before this is ready:

API Group & Project Alignment

  • The proposed resourcescalinggroup.com/v1 API group doesn't align with the project's existing convention (workloads.x-k8s.io). Is there a reason RSG lives in a separate API group? If this is intended to be part of the RBG ecosystem, it should probably be under the same group.
  • The relationship between RSG and RBGSA (RoleBasedGroupScalingAdapter) needs to be spelled out explicitly. Does RSG supersede RBGSA for these use cases, or do they coexist? What's the migration path?

Type Definition Inconsistencies

  • Ratio is declared as int32 in the Go struct, but the comment says "Using string to support decimals safely in JSON (parsed as float64 in controller)." The YAML examples use 1.0 and 2.0. Pick one representation and make it consistent. If you need fractional ratios (e.g., 0.25), int32 won't work.
  • In resourcescalinggroup.yaml, the second example has roleName: "Decode" (capitalized) but the KEP body uses roleName: "decode" (lowercase). Which is canonical?

Underspecified Mechanisms

  • HPA integration: Status has a LabelSelector field "used by HPA to discover the resource," implying RSG implements the Scale subresource. This is a critical integration point that deserves its own section — how does HPA target RSG? What does the /scale endpoint look like?
  • GroupReplication scale-out trigger: Who/what increments replicas? If HPA drives it, the Scale subresource contract is essential. If it's manual, say so explicitly.
  • Name resolution in InplaceScaling bindings: In the second example, bindings reference prefill and decode by name, but those are subResources of rolebasedgroup1, not top-level entries in targetResources. How does the controller resolve this? The indirection needs documenting.
  • Scale-down candidate discovery: How are group IDs (e.g., inference-unit-02) assigned, tracked, and communicated to external systems like traffic gateways? Without this, the "zero-downtime" story is incomplete.

Missing Sections

  • Test Plan (Unit, Integration, E2E) — all empty. At minimum, outline what the key test scenarios are.
  • Alternatives — what other approaches were considered? Why not extend RBGSA instead of introducing a new CRD?
  • Graduation criteria and timeline are absent.

Deletion Semantics

  • Orphan handling only covers InplaceScaling (Retain). What happens in GroupReplication mode when RSG is deleted — are cloned resources garbage collected via OwnerReferences, or also retained?

Minor

  • Both files are missing a trailing newline.
  • The KEP file has spaces in its name (KEP-98 ResourceScalingGroup...), which can cause issues with some tooling. Consider using hyphens.
  • The webhook validation for circular dependency prevention is mentioned but not specified. At least outline the algorithm (topological sort on the binding graph).

Overall the design has legs, but there's enough underspecification that it's hard to evaluate feasibility or review an implementation against this. Looking forward to a revision that fills in the gaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants